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Abstract 

We consider pseudo-random number generators 
suitable for vector processors. In particular, we de- 
scribe vectorised implementations of the Box-Muller 
and Polar methods, and show that they give good per- 
formance on the Fujitsu VP2200. We also consider 
some other popular methods, e.g. the Ratio method and 
the method of Von Neumann and Forsythe, and show 
why they are unlikely to be competitive with the Polar 
method on vector processors. 



1 Introduction 

Several recent papers [2, 4, 16] have considered the 
generation of uniformly distributed pseudo-random 
numbers on vector and parallel computers. In many 
applications, random numbers from specified non- 
uniform distributions are required. These distribu- 
tions may be continuous (e.g. normal, exponential) or 
discrete (e.g. Poisson). A common requirement is for 
the normal distribution. 

The most efficient methods for generating nor- 
mally distributed random variables on sequential ma- 
chines [1, 3, 6, 7, 10, 11] involve the use of different 
approximations on different intervals, and/or the use 
of "rejection" methods, so they often do not vectorise 
well. Simple, "old-fashioned" methods may be prefer- 
able. In Section 2 we describe two such methods, and 
in Sections 3-4 we consider their efficient implementa- 
tion on vector processors, and give the results of imple- 
mentations on a Fujitsu VP2200/10. In Sections 5-6 
we consider some other methods which are popular on 
serial machines, and show that they are unlikely to be 
competitive on vector processors. 
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2 Some Normal Generators 

Assume that a good uniform random number gen- 
erator which returns uniformly distributed numbers 
in the interval [0,1) is available, and that we wish to 
sample the normal distribution with mean fi and vari- 
ance (T^. We can generate two independent, normally 
distributed numbers x, y by the following old algo- 
rithm due to Box and Muller [14] (Algorithm Bl): 

1. Generate independent uniform numbers u and v. 

2. Set r ^ CT^-21n(l - u). 

3. Set a; -s— r sin(27ri;) + /i and y r cos(27rw) + fi. 
Two minor points: 

a. We have written (1 — u) instead of u at step 2 
because of our assumption that the uniform ran- 
dom number generator may return exactly but 
never exactly 1. If the uniform generator never 
returns exactly 0, then (1 — u) can be replaced 
by u. Similar comments apply below. 

b. The argument of sin and cos in step 3 is in the in- 
terval [— TT, tt), but any interval of length 27r would 
be satisfactory. 

The proof that the algorithm is correct follows on 
considering the distribution of {x,y) transformed to 
polar coordinates, and is similar to the proof of cor- 
rectness of the Polar method, given in [10]. 

Algorithm Bl is a reasonable choice if vectorised 
square root, logarithm and trigonometric function rou- 
tines are available. Each normally distributed number 
requires 1 uniformly distributed number, 0.5 square 
roots, 0.5 logarithms, and 1 sin or cos evaluation. Vec- 
torised implementations of the Box-Muller method are 
discussed in Section 3. 

A variation of Algorithm Bl is the Polar method 
of Box, Muller and Marsaglia (1958) {Algorithm P of 
Knuth [10]): 
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1. Generate independent uniform random numbers 
X and y in the interval [—1,1). 

2. Compute s + y^. 

3. If s > 1 then go to step 1 (i.e. reject x and y) else 
go to step 4. 

4. Set r ^ a^/—2ln{s)/s, and return rx + ji and 
ry + n- 

It is easy to see that, at step 4, {x,y) is uniformly 
distributed in the unit circle, so s is uniformly dis- 
tributed in [0,1). To avoid the remote possibility of 
division by zero at step 4, we could replace ln(s)/s by 
ln(l-,s)/(l-,s). 

A proof that the values returned by Algorithm P 
are independent, normally distributed random num- 
bers (with mean and variance a^) is given in 
Knuth [10]. On average, step 1 is executed 4/7r 
times, so each normally distributed number requires 
4/7r ~ 1.27 uniform random numbers, 0.5 divisions, 
0.5 square roots, and 0.5 logarithms. Compared to 
Algorithm Bl, we have avoided the sin and cos com- 
putation at the expense of more uniform random num- 
bers, 0.5 divisions, and the cost of implementing the 
acceptance/rejection process. This can be done using 
a vector gather. Vectorised implementations of the 
Polar method are discussed in Section 4. 

3 Implementation of the Box-Muller 

Method 

We have implemented the Box-Muller method (Al- 
gorithm Bl above) and several refinements (B2, B3) 
on a Fujitsu VP 2200/10 vector processor at the Aus- 
tralian National University. The implementations all 
return double-precision real results, and in eases where 
approximations to sin, cos, sqrt and/or In have been 
made, the absolute error is considerably less than 
10^^*^. Thus, statistical tests using less than about 
10^*^ random numbers should not be able to detect any 
bias due to the approximations. The calling sequences 
allow for an array of random numbers to be returned. 
This permits vectorisation and amortises the cost of a 
subroutine call over the cost of generating many ran- 
dom numbers. 

Our method B2 is the same as Bl, except that we 
replace calls to the Fortran library sin and cos by an 
inline computation, using a fast, but sufficiently ac- 
curate, approximation. Let y = {2ttv — 7r)/16, where 
< t; < 1, so < 7r/16. We approximate siny by a 
polynomial of the form siy + s^y^ + s^y^ + siy"^ , and 
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13.1 
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13.1 
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sin/cos 


13.8 
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6.6 


0.0 
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0.0 


other 


5.9 
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11.6 


11.9 


13.8 


35.1 


total 


41.6 


34.1 


26.3 


33.8 


21.9 


35.4 



Table 1: Cycles per normal random number 



cosy by a polynomial of the form co+C2y^+d^y^+cey^. 
Then, using the identities 

sin 2y = 2 sin y cos y, cos 2y = 1 — 2 sin^ y 

four times, we can compute sin 16y and cos 16y with 
a small number of multiplications and additions. The 
computation is vectorizable. 

Times, in machine cycles per normally distributed 
number, for methods Bl, B2 (and other methods de- 
scribed below) are given in Table 1. In all cases 
the generalised Fibonacci random number genera- 
tor RANU4 (described in [4]) was used to generate 
the required uniform random numbers, and a large 
number of random numbers were generated, so that 
vector lengths were long. RANU4 generates a uni- 
formly distributed random number in 2.2 cycles on 
the VP 2200/10. (The cycle time of the VP 2200/10 
at ANU is 3.2 nsec, and two multiplies and two adds 
can be performed per clock cycle, so the peak speed is 
1.25 Gflop.) 

The Table gives the total times and also the esti- 
mated times for the four main components: 

1. In computation (actually 0.5 times the cost of one 
In computation since the times are per normal 
random number generated). 

2. sqrt computation (actually 0.5 times). 

3. sin or cos computation. 

4. other, including uniform random number genera- 
tion. 

The results for method Bl show that the sin/cos 
and In computations are the most expensive (65% of 
the total time). Method B2 is successful in reducing 
the sin/cos time from 33% of the total to 19%. 

In Method B2, 64% of the time is consumed by 
the computation of y^— ln(l — u). An obvious way to 
reduce this time is to use a fast approximation to the 
function 

f{u) = V-ln(l-«), 

just as we used a fast approximation to sin and cos 
to speed up method Bl. However, this is difficult to 
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accomplish with sufBcicnt accuracy, because the func- 
tion f{u) has singularities at both endpoints of the 
unit interval. Method B3 overcomes this difficulty in 
the following way. 

1. We approximate the function 

V u 

rather than f{u). Using the Taylor series for 
In(l — u), we see that g{u) = 1 + u/4 + • • • is 
well-behaved near u — 0. 

2. The approximation to g{u) is only used in the 
interval < m < r, where r < 1 is suitably cho- 
sen. For T < u < 1 we use the slow but accurate 
Fortran In and sqrt routines. 

3. We make a change of variable of the form v = 
{au + l3)/{'ju + 6), where a,. . . ,6 are chosen to 
map [0,t] to [—1,1], and the remaining degrees 
of freedom are used to move the singularities of 
the function h{v) = g{u) as far away as possible 
from the region of interest (which is— l<t;<l). 
To be more precise, let p be a positive parameter. 
Then we can choose 

and the singularities of h{v) are at ±{p -\- 1). 

For simplicity, we choose p = 1, which experiment 
shows is close to optimal on the VP 2200/10. Then 
r = 8/9, V = (6m — 4)/(4 — Bu), and h{v) has singu- 
larities at V = ±2, corresponding to the singularities 
of g{u) at M = 1 and u ^ co. A polynomial of the 
form /lo + hiv + • • • + hi;,v^^ can be used to approxi- 
mate h{v) with absolute error less than 2 x 10^^^ on 
[—1,1]. About 30 terms would be needed if we at- 
tempted to approximate g{u) to the same accuracy by 
a polynomial on [0,r]. We use polynomial approxi- 
mations which are close to minimax approximations. 
These may easily be obtained by truncating Cheby- 
shev series, as described in [5]. 

It appears that this approach requires the compu- 
tation of a square root, since we really want f{u) — 
u^^'^g{u), not g{u). However, a trick allows this square 
root computation to be avoided, at the expense of an 
additional uniform random number generation (which 
is cheap) and a few arithmetic operations. Recall that 



M is a uniformly distributcxl random variable on [0, 1). 
We generate two independent uniform variables, say 
wi and U2, and let u ■<— max(Mi, tt2)^- It is easy to 
see that u is in fact imiformly distributed on [0,1). 
However, u^/^ = max(Mi,U2) can be computed with- 
out calling the library sqrt routine. To summarise, a 
non-vectorised version of method B3 is: 

1. Generate uniform random numbers Ui, U2 and 
on [0,1). 

2. Set m ^ max(ui, U2) and u ^ rn?. 

3. liu> 8/9 then 

3.1. set r ■«— a ^/^^\ni^^^^u) using Fortran library 
routines, else 

3.2. set V ■<— (6^ — 4)/(4 — 3u), evaluate h{v) as 
described above, and set r ■<— cFmh{v). 

4. Evaluate s ■<— sin(27ru3— tt) and c ■<— cos(27ru3— tt) 
as described above. 

5. Return /x -|- cr\/2 and jj, + sr\/2, which are inde- 
pendent, normal random numbers with mean jj, 
and standard deviation a. 

Vectorization of method B3 is straightforward, and 
can take advantage of the "list vector" technique on 
the VP2200. The idea is to gather those u > 8/9 
into a contiguous array, call the vectorised library rou- 
tines to compute an array of y^— ln(l — u) values, and 
scatter these back. The gather and scatter operations 
do introduce some overhead, as can be seen from the 
row labelled "other" in the Table. Nevertheless, on 
the VP2200, method B3 is about 23% faster than 
method B2, and about 37% faster than the straight- 
forward method Bl. These ratios could be different on 
machines with more (or less) efficient implementations 
of scatter and gather. 

Petersen [16] gives times for normal and uniform 
random number generators on a NEC SX-3. His 
implenic;ntation normalen of the Box-MuUer mcithod 
takes 55.5 nsec per normally distributed mmiber, i.e. it 
is 2.4 times faster than our method Bl, and 1.51 times 
faster than our method B3. The model of SX-3 used 
by Petersen has an effective peak speed of 2.75 Gflop, 
which is 2.2 times the peak speed of the VP 2200/10. 
Considering the relative speeds of the two machines 
and the fact that the SX-3 has a hardware square root 
function, our results are quite encouraging. 
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4 Implementation of the Polcir Method 

The times given in Table 1 for methods B1-B3 can 
be used to predict the best possible performance of the 
Polar method (Section 2). The Polar method avoids 
the computation of sin and cos, so could gain up to 6.6 
cycles per normal random number over method B3. 
However, we would expect the gain to be less than 
this because of the overhead of a vector gather caused 
by use of a rejection method. A straightforward vec- 
torised implementation of the Polar method, called 
method PI, was written to test this prediction. The 
results arc shown in Table 1. 13.8 cyck^s arc^ saved 
by avoiding the sin and cos function evaluations, but 
the overhead increases by 6.0 cycles, giving an over- 
all saving of 7.8 cycles or 19%. Thus, method PI is 
about the same speed as method B2, but not as fast 
as method B3. 

Encouraged by our success in avoiding most In 
and sqrt computations in the Box-Muller method (see 
method B3), we considered if a similar idea would 
work for the Polar method. In fact, it does. Step 
4 of the Polar method (Section 2) involves the compu- 
tation of a/— 21n(,s)/,s, where < s < 1. The function 
has a singularity at s = 0, but we can approximate 
it quite well on an interval such as [1/9,1], using a 
method similar to that used to approximate the func- 
tion g{u) of Section 3. 

Inspection of the proof in Knuth [10] shows that 
step 4 of the Polar method can be replaced by 

4a. Set r a^/-2ln{u)/s, 

and return rx + fi and ry + ^ 

where u is any uniformly distributed variable on (0, 1], 
provided u is independent of arctan(y/x). In particu- 
lar, we can take u = 1 — s. Thus, omitting the constant 
factor a\/2, we need to evaluate ^/^^]n{T^^s)Js, but 
this is just g{s), and we can use exactly the same ap- 
proximation as in Section 3. This gives us method P2. 
To summarise, a non-vectorised version of method P2 
is: 

1. Generate independent uniform random numbers 
X and y in the interval [—1, 1). 

2. Compute s x'^ +y'^. 

3. If s > 1 then go to step 1 (i.e. reject x and y) else 

go to step 4. 

4. If s > 8/9 then 

4.1. set r ^ cr-y/— ln(l — s)/s using Fortran li- 
brary routines, else 



4.2. set V ^ (6s - 4)/(4 - 3s), evaluate h{v) as 
described in Section 3, and set r ah{v). 

5. Return xr\pi + /i and yr\pi + /i, which are inde- 
pendent, normal random numbers with mean /x 
and standard deviation a. 

To vectorise steps 1-3, we simply generate vectors of 
Xj and yj values, compute Sj = x^+yj, and compress 
by omitting any triple {xj,yj,Sj) for which sj > 1. 
This means that we can not predict in advance how 
many normal random numbers will be generated, but 
this problem is easily handled by introdiicing a level 
of buffering. The vectorised version of method P2 is 
called RANN3B, and the user- friendly routine which 
performs the buffering and calls RANN3B is called 
RANN3. 

The second-last column of Table 1 gives results for 

method P2 (actually for RANN3, since the buffering 
overhead is included). There is a saving of 11.9 cycles 
or 35% compared to method PI, and the method is 
17% faster than the fastest version of the Box-Muller 
method (method B3). The cost of logarithm and 
square root computations is only 37% of the total, 
the remainder being the cost of generating uniform 
random numbers (about 13%) and the cost of the re- 
jection step and other overheads (about 50%). On 
the VP2200/10 we can generate more than 14 million 
normally distributed random numbers per second (one 
per 70 nsec). 

5 The Ratio Method 

The Polar method is one of the simplest of a class of 

rejection methods for generating random samples from 
the normal (and other) distributions. Other examples 
are given in [1, 3, 6, 10]. It is possible to implement 
some of these methods in a manner similar to our im- 
plementation of method P2. For example, a popular 
method is the Ratio Method of Kinderman and Mon- 
ahan [9] (also described in [10], and improved in [11]). 
In its simplest form, the Ratio Method is given by 
Algorithm R: 

1. Generate independent uniform random numbers 
u and V in [0, 1). 

2. Set X ^ vWe(w - 5)/(l - u). 

3. If — a;^ ln(l — u) > 4 then go to step 1 (i.e. reject x) 
else go to step 4. 

4. Return ax + ji. 
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Algorithm R returns a normally distributed ran- 
dom number using (on average) %l ^Jl^ ~ 2.74 uniform 
random numbers and 1.37 logarithm evaluations. The 
proof of correctness, and various refinements which re- 
duce the number of logarithm evaluations, are given 
in [9, 10, 11]. The idea of the proof is that x is nor- 
mally distributed if the point (u, v) lies inside a certain 
closed curve C which in turn is inside the rectangle 
[0,1] X \-^/2fe,+^flfe\. Step 3 rejects {u,v) if it is 
outside C. 

The function ln(l — u) occurring at step 3 has a 
singularity at w = 1, but it can be evaluated using a 
polynomial or rational approximation on some interval 
[0,r], where r < 1, in much the same way as the 
function g{u) of Section 3. 

The refinements added by Kinderman and Mon- 
ahan [9] and Leva [11] avoid most of the logarithm 
evaluations. The following step is added: 

2.5. If Pi(w, ii) then go to step 4 
else if P-2, {u, v) then go to step 1 
else go to step 3. 

Here Pi (it, v) and P2(u, v) are easily-computed con- 
ditions. Geometrically, Pi corresponds to a region Pi 
which lies inside C, and P2 corresponds to a region 
R2 which encloses C, but Pi and P2 have almost the 
same area. Step 3 is only executed if {u, v) lies in the 
borderline region P2\Pi. 

Step 2.5 can be vectorised, but at the expense of 
several vector scatter/gather operations. Thus, the 
saving in logarithm evaluations is partly cancelled out 
by an increase in overheads. The last column (Rl) 
of Table 1 gives the times for our implementation on 
the VP2200. As expected, the time for the logarithm 
computation is now negligible, and the overheads dom- 
inate. In percentage terms the times are: 

1% logarithm computation (using the library rou- 
tine), 

17% uniform random number computation, 

23% scatter and gather to handle borderline region, 

59% step 2.5 and other overheads. 

Although disappointing, the result for the Ratio 
method is not surprising, because the computations 
and overheads are similar to those for method P2 
(though with less logarithm computations), but only 
half as many normal random numbers are produced. 
Thus, we would expect the Ratio method to be slightly 
better than half as fast as method P2, and this is what 
Table 1 shows. 



6 GRAND 

On serial machines GRAND [3] is competitive with 
the Ratio method. In fact, GRAND is the fastest of 
the methods compared by Leva [11]. GRAND is based 
on an idea of Von Neumann and Forsythe for generat- 
ing samples from a distribution with density function 
cexp(— /i(.'j;)), where < h{x) < 1: 

1. Generate a uniform random number x, and set 
uo h{x). 

2. Generate independent uniform random numbers 

Ml, "2, • • • 

until the first k > such that Uk-i < Uk- 

3. If k is odd then return x, 
else reject x and go to step 1. 

A proof of correctness is given in Knuth [10]. 

It is hard to see how to implement GRAND effi- 
ciently on a vector processor. There are two prob- 
lems - 

1. A; is not bounded, even though its expected value 
is small. Thus, a sequence of gather operations 
seems to be requirc;(l. The result would be similar 
to Petersen's implementation [16] of a generator 
for the Poisson distribution (much slower than his 
implementation for the normal distribution). 

2. Because of the restriction < h{x) < 1, the area 
under the normal curve exp(— a;^/2)/\/27r has to 
be split into different regions from which samples 
are drawn with probabilities proportional to their 
areas. This complicates the implementation of 
the rejection step. 

For these reasons we would expect a vectorised im- 
plementation of GRAND to be even slower than our 
implementation of the Ratio method. Similar com- 
ments apply to other rejection methods which use an 
iterative rejection process and/or several different re- 
gions. 

7 Conclusion 

Wc have shown that both the Box-MuUcr and Po- 
lar methods vectorise well, and that it is possible to 
avoid and/or speed up the evaluation of the func- 
tions (sin, cos. In, sqrt) which appear necessary. On 
the VP2200/10 our best implementation of the Polar 
method takes 21.9 machine cycles per normal random 
number, slightly faster than our best implementation 
of the Box-MuUer method (26.3 cycles). 
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Wc also considered the vcictorisation of soinc! other 
popular methods for generating normally distributed 
random numbers, such as the Ratio method and the 
method of Von Neumann and Forsythe, and showed 
why such methods are unlikely to be faster than the 
Polar method on a vector processor. 
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